65 research outputs found
Predicting speculation: a simple disambiguation approach to hedge detection in biomedical literature
<p>Abstract</p> <p>Background</p> <p>This paper presents a novel approach to the problem of <it>hedge detection</it>, which involves identifying so-called hedge cues for labeling sentences as certain or uncertain. This is the classification problem for Task 1 of the CoNLL-2010 Shared Task, which focuses on hedging in the biomedical domain. We here propose to view hedge detection as a simple disambiguation problem, restricted to words that have previously been observed as hedge cues. As the feature space for the classifier is still very large, we also perform experiments with dimensionality reduction using the method of <it>random indexing</it>.</p> <p>Results</p> <p>The SVM-based classifiers developed in this paper achieves the best published results so far for sentence-level uncertainty prediction on the CoNLL-2010 Shared Task test data. We also show that the technique of random indexing can be successfully applied for reducing the dimensionality of the original feature space by several orders of magnitude, without sacrificing classifier performance.</p> <p>Conclusions</p> <p>This paper introduces a simplified approach to detecting speculation or uncertainty in text, focusing on the biomedical domain. Evaluated at the sentence-level, our SVM-based classifiers achieve the best published results so far. We also show that the feature space can be aggressively compressed using random indexing while still maintaining comparable classifier performance.</p
Temporal dynamics of semantic relations in word embeddings: an application to predicting armed conflict participants
This paper deals with using word embedding models to trace the temporal
dynamics of semantic relations between pairs of words. The set-up is similar to
the well-known analogies task, but expanded with a time dimension. To this end,
we apply incremental updating of the models with new training texts, including
incremental vocabulary expansion, coupled with learned transformation matrices
that let us map between members of the relation. The proposed approach is
evaluated on the task of predicting insurgent armed groups based on
geographical locations. The gold standard data for the time span 1994--2010 is
extracted from the UCDP Armed Conflicts dataset. The results show that the
method is feasible and outperforms the baselines, but also that important work
still remains to be done.Comment: to appear in EMNLP 2017 proceeding
Random Indexing Re-Hashed
Proceedings of the 18th Nordic Conference of Computational Linguistics
NODALIDA 2011.
Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa.
NEALT Proceedings Series, Vol. 11 (2011), 224-229.
© 2011 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/16955
Redefining part-of-speech classes with distributional semantic models
This paper studies how word embeddings trained on the British National Corpus
interact with part of speech boundaries. Our work targets the Universal PoS tag
set, which is currently actively being used for annotation of a range of
languages. We experiment with training classifiers for predicting PoS tags for
words based on their embeddings. The results show that the information about
PoS affiliation contained in the distributional vectors allows us to discover
groups of words with distributional patterns that differ from other words of
the same part of speech.
This data often reveals hidden inconsistencies of the annotation process or
guidelines. At the same time, it supports the notion of `soft' or `graded' part
of speech affiliations. Finally, we show that information about PoS is
distributed among dozens of vector components, not limited to only one or two
features
Transfer and Multi-Task Learning for Noun-Noun Compound Interpretation
In this paper, we empirically evaluate the utility of transfer and multi-task
learning on a challenging semantic classification task: semantic interpretation
of noun--noun compounds. Through a comprehensive series of experiments and
in-depth error analysis, we show that transfer learning via parameter
initialization and multi-task learning via parameter sharing can help a neural
classification model generalize over a highly skewed distribution of relations.
Further, we demonstrate how dual annotation with two distinct sets of relations
over the same set of compounds can be exploited to improve the overall accuracy
of a neural classifier and its F1 scores on the less frequent, but more
difficult relations.Comment: EMNLP 2018: Conference on Empirical Methods in Natural Language
Processing (EMNLP
Evaluating Semantic Vectors for Norwegian
In this article, we present two benchmark data sets for evaluating models of semantic word similarity for Norwegian. While such resources are available for English, they did not exist for Norwegian prior to this work. Furthermore, we produce large-coverage semantic vectors trained on the Norwegian Newspaper Corpus using several popular word embedding frameworks. Finally, we demonstrate the usefulness of the created resources for evaluating performance of different word embedding models on the tasks of analogical reasoning and synonym detection. The benchmark data sets and word embeddings are all made freely available
Improving cross-domain dependency parsing with dependency-derived clusters
This paper describes a semi-supervised approach to improving statistical dependency parsing using dependency-based word clusters. After applying a baseline parser to unlabeled text, clusters are induced using K-means with word features based on the dependency structures. The parser is then re-trained using information about the clusters, yielding improved parsing accuracy on a range of different data sets, including WSJ and the English Web Treebank. We report improved results using both in-domain and out-of-domain data, and also include a comparison with using n-gram-based Brown clustering
- …